Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client/fingerprint/cpu: use fallback total compute value if cpu not detected #9589

Merged
merged 1 commit into from
Dec 9, 2020

Conversation

shoenig
Copy link
Contributor

@shoenig shoenig commented Dec 9, 2020

Previously, Nomad would fail to startup if the CPU fingerprinter could
not detect the cpu total compute (i.e. cores * mhz). This is common on
some EC2 instance types (graviton class), where the env_aws fingerprinter
will override the detected CPU performance with a more accurate value
anyway.

Instead of crashing on startup, have Nomad use a low default for available
cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad
to get past the ignored cpu fingerprinting on those EC2 instances. The
crashing error message is now a log statement suggesting the setting of
cpu_total_compute in client config.

Fixes #7989

…etected

Previously, Nomad would fail to startup if the CPU fingerprinter could
not detect the cpu total compute (i.e. cores * mhz). This is common on
some EC2 instance types (graviton class), where the env_aws fingerprinter
will override the detected CPU performance with a more accurate value
anyway.

Instead of crashing on startup, have Nomad use a low default for available
cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad
to get past the useless cpu fingerprinting on those EC2 instances. The
crashing error message is now a log statement suggesting the setting of
cpu_total_compute in client config.

Fixes #7989
@shoenig
Copy link
Contributor Author

shoenig commented Dec 9, 2020

Quick spot testing on a t4gsmall:

env_aws enabled

2020-12-09T16:24:47.024Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=2
2020-12-09T16:24:47.024Z [INFO]  client.fingerprint_mgr.cpu: fallback to default cpu total compute, set client config option cpu_total_compute to override
    
2020-12-09T16:24:47.041Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu: cores=2 ghz=2.5
2020-12-09T16:24:47.041Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu: ticks=5000

Allocated Resources
CPU         Memory       Disk
0/5000 MHz  0 B/1.8 GiB  0 B/4.3 GiB

env_aws disabled

2020-12-09T16:30:51.484Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=2
2020-12-09T16:30:51.484Z [INFO]  client.fingerprint_mgr.cpu: fallback to default cpu total compute, set client config option cpu_total_compute to override
    
2020-12-09T16:30:51.488Z [DEBUG] client.fingerprint_mgr: fingerprint modules skipped due to allow/denylist: skipped_fingerprinters=[env_aws]

Allocated Resources
CPU         Memory       Disk
0/1000 MHz  0 B/1.8 GiB  0 B/4.3 GiB

@shoenig shoenig requested review from notnoop and tgross December 9, 2020 16:52
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This seems like a much nicer user experience.

return fmt.Errorf("cannot detect cpu total compute. "+
"CPU compute must be set manually using the client config option %q",
"cpu_total_compute")
f.logger.Info("fallback to default cpu total compute, set client config option cpu_total_compute to override")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log line is a little unfortunate in that it's important to see in non-AWS cases but in AWS cases we're going to fix it in the fingerprinter, which logs at debug. But I don't see a good way around that other than dropping this one to debug... which I think would be a bad idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could be convinced to pick any of the possible combinations of Info/Debug. Realistically the most correct solution to all of this would involve plumbing to enable fingerprinters to coordinate. But for this, that would be like driving a thumbtack with sledgehammer.

@shoenig shoenig merged commit c1f1b15 into master Dec 9, 2020
@shoenig shoenig deleted the f-aws-graviton branch December 9, 2020 17:19
@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad unable to launch on EC2 graviton instance types
2 participants